Assignment - VAST Challenge MC2

This post presents the step by step instructions that were followed to identify which GASTech employess made which purchases and identify suspicious patterns of behavior using visual analytic techniques.

Mayurapriyann Arulmozhi Baskaran https://www.linkedin.com/in/mayurapriyann/
07-25-2021

Overview

Tethys-based GAStech has been operating a natural gas production site in the island country of Kronos for the past twenty years. The company has produced remarkable profits and developed strong relationships with the government of Kronos. However, GAStech has not been successful in demonstrating environmental stewardship.

In January 2014, the leaders of GAStech are celebrating their newfound fortune as a result of the initial public offering of their very successful company. In the midst of this celebration, several employees of GAStech go missing. An organization known as the Protectors of Kronos (POK) is suspected of the disappearance.

Objective

Most of the employees of GAStech are provided with company cars approved for both personal and business use. Those who do not have company cars are given access to company trucks, but these cannot be used for personal use.

The company cars are of much higher quality than the cars the employees would be able to afford otherwise. However, GAStech has installed geospatial tracking software in the company vehicles to track the movements periodically. The vehicle data is available for two weeks prior to the disappearance.

In addition to the geospatial data, Kronos-based companies provide a benefits card giving the employees discounts and rewards in exchange for collecting the credit card purchases and preferences. The objective of the analysis is to identify suspicious patterns of behavior of the employees to make recommendations for further investigation.

In January 2014, the leaders of GAStech are celebrating their newfound fortune as a result of the initial public offering of their very successful company. In the midst of this celebration, several employees of GAStech go missing. An organization known as the Protectors of Kronos (POK) is suspected of the disappearance.

Literature review

Literature review was conducted to analyze how the various analyses with similar objectives had been performed earlier and whether those techniques used in the past are applicable to this assignment with/without enhancements. Also, the issues identified in the previous analyses were also discussed and alternative approaches are provided to overcome those issues. The mini-challenge published in the year 2014 was similar hence a few of those submissions were reviewed before starting this assignment.

  1. In the submission by Middlesex University, the credit card and loyalty card usage by the employees of GASTech was visualized by 3D bar plot and it can be inferred that truck drivers tend to spend the most compared to the other employees but it is always advisable to avoid using 3D effects unless we have three-dimensional data. Clear visualizations can be achieved by using interactive 2D plots. Most of the analyses have been performed by custom-made analytical tools which are not accessible by the general public. Hence this assignment is built on R packages to cater wide segment of the audience mainly due to script reproducibility.
  2. In the submission by KU Leven University, the daily routine is interestingly visualized by grouping it according to the department type and further faceted by day of the month. A box plot is utilized to find the outliers present for all the locations where the credit cards are used. Though it is spaced perfectly to get a clear view, we were not able to pinpoint the location of the outlier quickly as there are many locations. To overcome this issue, in our assignment I have added interactivity by adding the tooltip so by just hovering the mouse cursor on the outliers shows the corresponding location where the transaction had taken place and in addition to this, other details such as minimum, 25th percentile, median, 75th percentile, maximum values can also be found.
  3. In the submission by the University of Buenos Aires, a heat map is used to visualize the credit card data per hour with sequential colors to find out the abnormalities, though it is visually appealing to find the abnormalities, the exact number of transactions cannot be got from the plot. This issue can be overcome by adding interactivity to the heat map so that the count of the transactions can be known via tooltips just hovering the cursor over the specific points.

Task 1

Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?

All the data extraction, wrangling and preparing the input data required to perform the analysis are done in R using appropriate packages.

Setting up the environment

Firstly, it is important to install the necessary R packages and launch them into R Studio environment. The below code chunk checks for the available packages and installs those packages that are not available. In-addition to that, all those packages will be launched into RStudio environment.

raster - To read, write, manipulate, analyze, model spatial data
sf - To support simple features and encode spatial vector data
tmap - To visualize spatial data distributions
tidyverse - A opinionated collection of R packages for data science
clock - Provides a comprehensive library for date-time manipulations
rgdal - Works on both raster and vector data types for manipulating geospatial data
tidytext - Makes text mining tasks easier, more effective and consistent
widyr - To widen the matrix, perform some processing, then turn back to tidy form
DT - To render data objects as HTML tables using JavaScript library
dplyr - A fast, consistent tool for working with data frame like objects
hms - To store and format time-of-day values
ggraph - An extension of ggplot2 for graph and network visualizations
igraph - For simple graphs and network analysis
crosstalk - To support linked brushing and filtering
plotly - To create interactive web graphics and custom interface
data.table - To handle data sets in R
stringi - For fast, correct, consistent, and convenient string/text manipulation
mapview - To create interactive visualizations of spatial data
ggridges - Used for visualizing changes in distributions over time
networkD3 - To create interactive network graphs with JavaScript
htmlwidgets - A framework for creating HTML widgets in R console

packages = c('raster','sf','tmap','tidyverse','clock','rgdal','tidytext','widyr',
             'DT','dplyr','hms','ggraph','igraph','crosstalk',
             'plotly','data.table','stringi','mapview','ggridges','networkD3',
             'htmlwidgets')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Reading the credit card and loyalty card details

Importing credit card and loyalty card details into two separate data frames.

cc <- read_csv(file = 'data/cc_data.csv')

lc <- read_csv(file = 'data/loyalty_data.csv')

Renaming foreign characters

A single value under the location column has a non-english character hence replacing the foreign character with alternative english character.

setDT(cc)[location %like% "^Katerina", location := "Katerina's Cafe"]
setDT(lc)[location %like% "^Katerina", location := "Katerina's Cafe"]

View the columns of the dataset and their respective attributes.

glimpse(cc)
Rows: 1,490
Columns: 4
$ timestamp  <chr> "01/06/2014 07:28", "01/06/2014 07:34", "01/06/20~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
glimpse(lc)
Rows: 1,392
Columns: 4
$ timestamp  <chr> "01/06/2014", "01/06/2014", "01/06/2014", "01/06/~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~

Changing data type and extracting new columns

It can be seen that the timestamp column is in character datatype. Hence, it is converted to timestamp format. After which day, hour and day of the week are extracted from the timestamp column. In addition, the ‘last4ccnum’ column in the credit card dataset is changed from integer format to character format.

cc$timestamp <- date_time_parse(cc$timestamp,
                                zone = "",
                                format = "%m/%d/%Y %H:%M")

lc$timestamp <- date_time_parse(lc$timestamp,
                                zone = "",
                                format = "%m/%d/%Y")


cc <- cc %>%
  mutate(day = get_day(timestamp),
         hour = get_hour(timestamp),
         date = as_date(timestamp))

lc <- lc %>%
  mutate(day = get_day(timestamp))


cc$dayofweek <- weekdays(cc$timestamp)

lc$dayofweek <- weekdays(lc$timestamp)

cc$last4ccnum <- as.character(cc$last4ccnum)

Getting a glimpse after change

View the columns of the dataset and their respective attributes after pre-processing.

glimpse(cc)
Rows: 1,490
Columns: 8
$ timestamp  <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <chr> "4795", "7108", "6816", "9617", "7384", "5368", "~
$ day        <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6~
$ hour       <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7~
$ date       <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ dayofweek  <chr> "Monday", "Monday", "Monday", "Monday", "Monday",~
glimpse(lc)
Rows: 1,392
Columns: 6
$ timestamp  <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
$ day        <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6~
$ dayofweek  <chr> "Monday", "Monday", "Monday", "Monday", "Monday",~

Usually, the loyalty cards are used along with the credit cards to get some discounts and rewards but there is a mismatch between the total number of records between the credit card transactions (1490) and loyalty card transactions (1392).

Getting the count of transactions made by the employees with credit cards and loyalty cards in order to find the most frequented and least frequented locations.

Building the visualization

Task 1

Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?

All the data extraction, wrangling and preparing the input data required to perform the analysis are done in R using appropriate packages.

Setting up the environment

Firstly, it is important to install the necessary R packages and launch them into R Studio environment. The below code chunk checks for the available packages and installs those packages that are not available. In-addition to that, all those packages will be launched into RStudio environment.

raster - To read, write, manipulate, analyze, model spatial data
sf - To support simple features and encode spatial vector data
tmap - To visualize spatial data distributions
tidyverse - A opinionated collection of R packages for data science
clock - Provides a comprehensive library for date-time manipulations
rgdal - Works on both raster and vector data types for manipulating geospatial data
tidytext - Makes text mining tasks easier, more effective and consistent
widyr - To widen the matrix, perform some processing, then turn back to tidy form
DT - To render data objects as HTML tables using JavaScript library
dplyr - A fast, consistent tool for working with data frame like objects
hms - To store and format time-of-day values
ggraph - An extension of ggplot2 for graph and network visualizations
igraph - For simple graphs and network analysis
crosstalk - To support linked brushing and filtering
plotly - To create interactive web graphics and custom interface
data.table - To handle data sets in R
stringi - For fast, correct, consistent, and convenient string/text manipulation
mapview - To create interactive visualizations of spatial data
ggridges - Used for visualizing changes in distributions over time
networkD3 - To create interactive network graphs with JavaScript
htmlwidgets - A framework for creating HTML widgets in R console

packages = c('raster','sf','tmap','tidyverse','clock','rgdal','tidytext','widyr',
             'DT','dplyr','hms','ggraph','igraph','crosstalk',
             'plotly','data.table','stringi','mapview','ggridges','networkD3',
             'htmlwidgets')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Reading the credit card and loyalty card details

Importing credit card and loyalty card details into two separate data frames.

cc <- read_csv(file = 'data/cc_data.csv')

lc <- read_csv(file = 'data/loyalty_data.csv')

Renaming foreign characters

A single value under the location column has a non-english character hence replacing the foreign character with alternative english character.

setDT(cc)[location %like% "^Katerina", location := "Katerina's Cafe"]
setDT(lc)[location %like% "^Katerina", location := "Katerina's Cafe"]

View the columns of the dataset and their respective attributes.

glimpse(cc)
Rows: 1,490
Columns: 4
$ timestamp  <chr> "01/06/2014 07:28", "01/06/2014 07:34", "01/06/20~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
glimpse(lc)
Rows: 1,392
Columns: 4
$ timestamp  <chr> "01/06/2014", "01/06/2014", "01/06/2014", "01/06/~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~

Changing data type and extracting new columns

It can be seen that the timestamp column is in character datatype. Hence, it is converted to timestamp format. After which day, hour and day of the week are extracted from the timestamp column. In addition, the ‘last4ccnum’ column in the credit card dataset is changed from integer format to character format.

cc$timestamp <- date_time_parse(cc$timestamp,
                                zone = "",
                                format = "%m/%d/%Y %H:%M")

lc$timestamp <- date_time_parse(lc$timestamp,
                                zone = "",
                                format = "%m/%d/%Y")


cc <- cc %>%
  mutate(day = get_day(timestamp),
         hour = get_hour(timestamp),
         date = as_date(timestamp))

lc <- lc %>%
  mutate(day = get_day(timestamp))


cc$dayofweek <- weekdays(cc$timestamp)

lc$dayofweek <- weekdays(lc$timestamp)

cc$last4ccnum <- as.character(cc$last4ccnum)

Getting a glimpse after change

View the columns of the dataset and their respective attributes after pre-processing.

glimpse(cc)
Rows: 1,490
Columns: 8
$ timestamp  <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <chr> "4795", "7108", "6816", "9617", "7384", "5368", "~
$ day        <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6~
$ hour       <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7~
$ date       <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ dayofweek  <chr> "Monday", "Monday", "Monday", "Monday", "Monday",~
glimpse(lc)
Rows: 1,392
Columns: 6
$ timestamp  <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
$ day        <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6~
$ dayofweek  <chr> "Monday", "Monday", "Monday", "Monday", "Monday",~

Usually, the loyalty cards are used along with the credit cards to get some discounts and rewards but there is a mismatch between the total number of records between the credit card transactions (1490) and loyalty card transactions (1392).

Getting the count of transactions made by the employees with credit cards and loyalty cards in order to find the most frequented and least frequented locations.

cc_bar <- cc %>%
  count(location, sort=TRUE)

lc_bar <- lc %>%
  count(location, sort=TRUE)

Plotting the most frequented and least frequented locations

fig <- plot_ly(cc_bar, x = ~location, y = ~n, type = 'bar', hoverinfo="text", text = ~paste(location,'</br></br>', n),
               marker = list(color = ~n,
                             colorbar = list(title = 'Visits'))) %>%
  layout(title = "Popular locations by credit card transactions",
         xaxis = list(title = "Location", categoryarray = ~location, categoryorder = "array", tickangle                = 270),
         yaxis = list(title = "Visits"))
fig
fig <- plot_ly(lc_bar, x = ~location, y = ~n, type = 'bar', hoverinfo="text", text = ~paste(location,'</br></br>', n),
               marker = list(color = ~n,
                             colorbar = list(title = 'Visits'))) %>%
  layout(title = "Popular locations by loyalty card transactions",
         xaxis = list(title = "Location", categoryarray = ~location, categoryorder = "array", tickangle                = 270),
         yaxis = list(title = "Visits")) 
fig

By comparing both the bar plots by credit card and loyalty card transactions, the most popular locations are identified as Katerina’s Cafe, Hippokampos, Guy’s Gyros and Brew’ve Been Served. The fifth popular location is Hallowed Grounds according to the number of transactions made by credit card. But there is some anomaly in the usage of loyalty cards in this location as there are mismatches in total credit card transactions and total loyalty card transactions. The total number of credit card transactions in this location is 92 but the total number of loyalty card transactions is only 80. One suspicious transaction that can be observed from the above plot is at location Daily Dealz as there was only one recorded credit card transaction over the period of 14 days and also the loyalty card is not used in this particular location.

Now plotting a heatmap to find out the most visited and least visited locations by hour of the day.

cc$location <- stri_trans_general(cc$location, "latin-ascii")
x_axis_labels <- min(cc[,'hour']):max(cc[,'hour'])

p <- group_by(cc,hour,location) %>% summarize(n=n()) %>% 
  ggplot(aes(hour,location,fill=n)) + geom_tile() +
  scale_x_continuous(expand=c(0,0),labels = x_axis_labels, breaks = x_axis_labels) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+
  scale_fill_distiller(palette = "Reds", direction = 1, labels = c(0, 20, 40, 60, "> 80")) +
  labs(title = "Most/least frequented locations by hour of the day", y = "Location", x = "Hour")

ggplotly(p)

Hourly popularity can be inferred from the above heatmap.

Morning 7 am - 9 am

  1. Brew’ve Been Served
  2. Hallowed Grounds
  3. Coffee Cameleon
    The name of these locations suggests that they might serve coffee and snacks. The surge in the morning hours may indicate that the employees tend to have their breakfast/coffee at these locations before going to GASTech.

Noon 12 pm - 2 pm

  1. Hippokampos
  2. Katerina’s Cafe
  3. Abila Zachora
  4. Gelatogalore
  5. Guy’s Gyros
    This might be the lunchtime for the GASTech employees. The list shows only the top 5 locations between 12 pm - 2 pm. But unlike breakfast locations which were limited to only three highly sought-after outlets, the employees tend to have various options for lunch.

Evening 7 pm - 9 pm

  1. Katerina’s Cafe
  2. Guy’s Gyros
  3. Hippokampos
  4. Frydos Autosupply n’ More
  5. Ouzeri Elian
    This should be the dinner time for the employees. The time period corresponding to morning and noon seem to have locations related to food and beverages. But between 7 pm - 9 pm, there seems to be some surge in visitors at Frydos Autosupply n’ More.

Below is a heatmap to find out the most visited and least visited locations by day of the month.

cc$location <- stri_trans_general(cc$location, "latin-ascii")
x_axis_labels <- min(cc[,'day']):max(cc[,'day'])

p <- group_by(cc,day,location) %>% summarize(n=n()) %>% 
  ggplot(aes(day,location,fill=n)) + geom_tile() +
  scale_x_continuous(expand=c(0,0),labels = x_axis_labels, breaks = x_axis_labels) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+
  scale_fill_distiller(palette = "Reds", direction = 1) +
  labs(y= "Location", x = "Day Of Month", title = "Most/least frequented locations by day of month")

ggplotly(p)

Over the 14 days period, places such as Katerina’s Cafe, Hippokampos and Guy’s Gyros are popular. One of the other popular places is Brew’ve Been Served but there are no transactions on both the weekends so we can infer that this outlet is closed on weekends.

Below is a heatmap to find out the most visited and least visited locations by day of the week.

cc$location <- stri_trans_general(cc$location, "latin-ascii")
x_axis_labels <- min(cc[,'day']):max(cc[,'day'])

p <- group_by(cc,dayofweek,location) %>% summarize(n=n()) %>% 
  ggplot(aes(dayofweek,location,fill=n)) + geom_tile() + 
  scale_fill_distiller(palette = "Reds", limits = c(0,10), na.value = "#de2d26",
                       direction = 1, labels = c(0.0, 2.5, 5.0, 7.5, "> 10.0")) +
  labs(y= "Location", x = "Day Of Week", title = "Most/least frequented locations by day of week") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5), panel.grid.major = element_blank())

ggplotly(p)

Weekends usually tend to have fewer visitors compared to weekdays. And some of the outlets had no transactions made during the weekends but had a decent number of transactions during weekdays implying that few outlets may be closed on weekends.

Joining credit card data and loyalty card data

Using a left join to merge loyalty card data with credit card data to map all the transactions.

merged_cards <- left_join(cc, lc, by = c("date" = "timestamp", "location" = "location", "price" = "price"))
glimpse(merged_cards)
Rows: 1,496
Columns: 11
$ timestamp   <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-~
$ location    <chr> "Brew've Been Served", "Hallowed Grounds", "Brew~
$ price       <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.~
$ last4ccnum  <chr> "4795", "7108", "6816", "9617", "7384", "5368", ~
$ day.x       <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ~
$ hour        <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ~
$ date        <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06,~
$ dayofweek.x <chr> "Monday", "Monday", "Monday", "Monday", "Monday"~
$ loyaltynum  <chr> "L8566", NA, "L8148", "L5553", "L3800", "L2247",~
$ day.y       <int> 6, NA, 6, 6, 6, 6, NA, 6, 6, 6, 6, 6, NA, NA, 6,~
$ dayofweek.y <chr> "Monday", NA, "Monday", "Monday", "Monday", "Mon~

After the left join, the total number records increased from 1490 to 1496, this hows that multiple loyalty cards have been used to single credit card owners.

Below are the list of loyalty cards that were used by more than one credit card owners.

t <- merged_cards %>% 
  na.omit() %>%
  group_by(loyaltynum) %>%
  summarise(count = n_distinct(last4ccnum))

t %>%
  filter(count>1)
# A tibble: 8 x 2
  loyaltynum count
  <chr>      <int>
1 L2070          2
2 L2247          2
3 L3288          2
4 L3295          2
5 L6119          2
6 L6267          2
7 L8566          2
8 L9406          2

Mapping the corresponding credit card details to the loyalty card detils and getting the number of transactions performed.

bt <- merged_cards %>%
  filter(loyaltynum %in% c("L2070","L2247","L3288","L3295","L6119",
                           "L6267","L8566","L9406")) %>%
  group_by(last4ccnum, loyaltynum) %>%
  summarise(n=n())

bt
# A tibble: 16 x 3
# Groups:   last4ccnum [10]
   last4ccnum loyaltynum     n
   <chr>      <chr>      <int>
 1 1286       L3288         15
 2 4795       L2070          1
 3 4795       L8566         25
 4 4948       L3295          1
 5 4948       L9406         22
 6 5368       L2247         24
 7 5368       L6119          1
 8 5921       L3295         12
 9 5921       L9406          1
10 6691       L6267         20
11 6899       L6267         23
12 7889       L2247          1
13 7889       L6119         20
14 8332       L2070         27
15 8332       L8566          1
16 9241       L3288         13

Plotting the price distribution using ridge plot to get a sense of transaction ranges at different locations. The price column is transformed to log scale to get a better visualization.

ggplot(data = cc, aes(x=log(cc$price+1), y=location,
                           fill = 0.5 - abs(0.5 - stat(ecdf))
                           ))+
  geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01, gradient_lwd = 1., calc_ecdf = TRUE) +
  theme_ridges(font_size = 12, grid = FALSE)+
  scale_fill_viridis_c(name = "Tail probability", direction = -1) +
  labs(
    y= "Location",
    x = "Price (logscale)",
    title = "Price distribution by location"
  ) +
  theme(
axis.title.x = element_text(hjust = 0.5),
axis.title.y = element_text(hjust = 0.5)
)

This ridge plot is used to display the transaction price distribution across all the locations visited by the employees of GASTech. Transaction price is expressed in log scale for better visualization. Locations such as U-Pump, Maximum Iron and Steel, Kronos Pipe and Irrigation, Carlyle Chemical Inc., Abila Airport, Abila Scrapyard and Nationwide Refinery were the high priced transactions had taken place.

Below is a interactive data table with credit card transaction details which can be utilised at a later stage of the analysis.

DT::datatable(cc,
              filter = 'top') %>%
  formatStyle(columns = 0,
              target = 'row',
              lineHeight = '100%') %>%
  formatDate(1, "toLocaleString")

A granular level information can be got by referring the interactive data table to check for any anomalies throughout this analysis.

Below is a interactive data table with credit card transaction details and loyalty card transaction details to know the credit card and loyalty card pairs that had been used together.

DT::datatable(merged_cards,
              filter = 'top') %>%
  formatStyle(columns = 0,
              target = 'row',
              lineHeight = '100%') %>%
  formatDate(1, "toLocaleString") %>%
  formatDate(7, "toLocaleDateString")

The below data table shows the unique transaction and total transaction details based on the locations.

z <- cc %>%
  group_by(location) %>%
  summarise(unique = n_distinct(last4ccnum), total=n())

DT::datatable(z,
              filter = 'top') %>%
  formatStyle(columns = 1,
              target = 'row',
              lineHeight = '100%')

Task 2

Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? Please limit your answer to 8 images and 500 words.

Importing raster file

bgmap <- raster("data/MC2-tourist.tif")
bgmap
class      : RasterLayer 
band       : 1  (of  3  bands)
dimensions : 1595, 2706, 4316070  (nrow, ncol, ncell)
resolution : 3.16216e-05, 3.16216e-05  (x, y)
extent     : 24.82419, 24.90976, 36.04499, 36.09543  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs 
source     : MC2-tourist.tif 
names      : MC2.tourist 
values     : 0, 255  (min, max)

Plotting Raster Layer

tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1, g = 2, b = 3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255)

The map of Abila gives us a clear picture of the distance that GASTech is away from the other locations visited by the employees. From the map of Abila, the most frequented locations such as Katerina’s Cafe, Guy’s Gyros, Brew’ve Been Served were all seems to be in close proximity with GASTech. Hippokampos was the second most frequented location but surprisingly it was not present in the map. Since the top 4 frequented locations are closely located, I assume Hippokampos also should be near to GASTech.

Importing vector GIS data file

Abila_st <- st_read(dsn = "data/Geospatial",
                    layer = "Abila")
Reading layer `Abila' from data source 
  `C:\abmayur05\VisualAnalytics\VAST Challenge MC2\data\Geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS:  WGS 84

Importing Aspatial Data

gps <- read_csv("data/gps.csv")
glimpse(gps)
Rows: 685,169
Columns: 4
$ Timestamp <chr> "01/06/2014 06:28:01", "01/06/2014 06:28:01", "01/~
$ id        <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~

Converting Date-Time field and ID field

gps$Timestamp <- date_time_parse(gps$Timestamp,
                                 zone = "",
                                 format = "%m/%d/%Y %H:%M:%S")
gps$id <- as_factor(gps$id)

Converting Aspatial Data Into A Simple Feature data Frame

gps_sf <- st_as_sf(gps,
                   coords = c("long","lat"),
                   crs = 4326)
gps_sf <- gps_sf %>%
  mutate(day = get_day(Timestamp),
         hour = get_hour(Timestamp),
         dayofweek = date_weekday_factor(Timestamp) ,
         date = as_date(Timestamp))
glimpse(gps_sf)
Rows: 685,169
Columns: 7
$ Timestamp <dttm> 2014-01-06 06:28:01, 2014-01-06 06:28:01, 2014-01~
$ id        <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ geometry  <POINT [°]> POINT (24.87469 36.07623), POINT (24.8746 36~
$ day       <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,~
$ hour      <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,~
$ dayofweek <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, ~
$ date      <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, 2~

Creating Movement Path From GPS Points

The code chunk below joins the GPS points into movement paths by using car id, date and hour as unique identifiers.

gps_path <- gps_sf %>%
  group_by(id, date, hour) %>%
  summarize(timestamp= mean(Timestamp),
            do_union=FALSE) %>%
  st_cast("LINESTRING")

Finding the orphan lines

After getting the movement paths we do see some orphan GPS points hence finding those orphan lines.

p = npts(gps_path, by_feature = TRUE)
gps_path2 <- cbind(gps_path, p)

Removing the orphan lines

gps_path3 <- gps_path2[!(gps_path2$p==1),]

Plotting the GPS Paths

Below code chunk is used to overplot the GPS path of all the car ids onto the background tourist map.

gps_path_selected <- gps_path3
tmap_mode("view")
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1, g = 2, b = 3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(gps_path_selected) +
  tm_lines(col='red')

Previously using the heat map plot on location vs hour of transaction shows that there are some transactions recorded at early morning around 3 am - 4 am at Kronos Mart. Hence further plotting the GPS path during the time period to see if any car id had passed by that way.